Tracing fix #128

sarsko · 2023-11-27T17:44:01Z

I want to add a bit more testing. At least something fairly minimal like connecting the tests to a tracing_subscriber::fmt which outputs to something else than stdout and then check on that, and a simple "X threads enter and exit Y spans and assert that the current span is what they expect" test.

Opening a PR in order to get feedback and suggestions. Feel free to suggest alternative test ideas as well. Would be nice to do a programmatic check against Tokio / no runtime as well.

Fixing tracing

I think I have done it right this time. One known divergence from Tokio (listed below), might be more, but I think the general scheme should work. Also adds a fix for spans stacking on test panic inside a catch_unwind (read: proptest minimization).

Overview of the changes

New config option, record_steps_in_span.
- The step{i=X, task=Y} span is now step{task=Y} by default, and step{task=Y, i=X} if record_steps_in_span is true
Updating tracing span is no longer done in advance_to_next_task. It is instead done in step, by entering a Tasks span before continuing its continuation, and leaving the Span when we return from the continuation.
Added ResetSpanOnDrop which is constructed in run. This is to solve the issue of span stacking on panics inside a catch_unwind (read: proptest minimization). I found this way of solving it much cleaner than finding all the places we could exit a Shuttle test and making sure to clean up spans there.
A couple of traces which were previously emitted in whatever span was present (ie the span of a Task) are now emitted in (an extension of) Shuttle's span.

Some notes

I mentioned that I considered doing a "default to no span" approach, meaning that Shuttle would exit whatever span was present on test entry, store that, and then have an explicit enter/leave policy for all tracing, thus avoiding all accidental "span-bleed". I decided against this because ensuring that the correct span would be entered, for instance in thread_fn was cumbersome and thus verbose, and thus more error-prone than using whatever trace was present. This means there may be traces emitted where we would prefer to use a different span than the one we are using currently (as I have not looked into every single Event emission), these can be handled as we find them.
The in_scope in spawn_thread and Task::new() are matters of taste — I prefer for these to be emitted under Shuttle's span, but, they are reactions to actions taken by the function under test, so there is an argument for emitting them under whatever span is currently entered.

Known issues / discrepancies

Below are some tests which do the same thing using either Shuttle sync/future threads, std::sync threads and tokio::spawn threads. Shuttle sync/future and std::sync does the same thing, tokio::spawn does something different. I think the thing does currently is sane enough (arguably better than the mess you may get when running under Tokio — this is not how to combine futures and spans).

Tokio

async fn awaiting_is_relevant() {
    let lock = Arc::new(Mutex::new(0));
    let threads: Vec<_> = (0..3).map(|i| {
        let lock = lock.clone();
        tokio::spawn(async move {
            let outer = warn_span!("outer", tid=i);
            let _outer = outer.enter();

            let lock_clone = lock.clone();
            let t = tokio::spawn(async move 
            {
                let mut locked = lock_clone.lock().unwrap();
                let inner = warn_span!("inner1", tid=i);
                let _inner = inner.enter();
                error!("incrementing from {}", *locked);
                *locked += 1;
            });

            t.await.unwrap();

            let _t = tokio::spawn(async move 
            {
                let mut locked = lock.lock().unwrap();
                let inner = warn_span!("inner2", tid=i);
                let _inner = inner.enter();
                error!("incrementing from {}", *locked);
                *locked += 1;
            });

        })
    }).collect();

    for thread in threads {
        thread.await.unwrap();
    }
}

#[tokio::test]
async fn tokio_awaiting_is_relevant() {
    let _ = tracing_subscriber::fmt::try_init();
    awaiting_is_relevant().await;
}

Doing RUST_LOG=warn cargo test tokio_awaiting gives:

running 1 test
2023-11-27T16:33:30.579493Z ERROR outer{tid=0}:outer{tid=1}:outer{tid=2}:inner1{tid=0}: play: incrementing from 0
2023-11-27T16:33:30.579761Z ERROR outer{tid=0}:outer{tid=1}:outer{tid=2}:inner1{tid=1}: play: incrementing from 1
2023-11-27T16:33:30.579794Z ERROR outer{tid=0}:outer{tid=1}:outer{tid=2}:inner1{tid=2}: play: incrementing from 2
2023-11-27T16:33:30.579850Z ERROR inner2{tid=0}: play: incrementing from 3
2023-11-27T16:33:30.579869Z ERROR inner2{tid=1}: play: incrementing from 4
2023-11-27T16:33:30.579885Z ERROR inner2{tid=2}: play: incrementing from 5
test tokio_awaiting_is_relevant ... ok

As we can see, doing t.await.unwrap(); for the first t makes it execute under the parent spans, and not doing it for the second t makes those have no parent.

Shuttle `future::spawn`

async fn shuttle_awaiting_is_irrelevant() {
    let lock = Arc::new(Mutex::new(0));
    let threads: Vec<_> = (0..3)
        .map(|i| {
            let lock = lock.clone();
            shuttle::future::spawn(async move {
                let outer = warn_span!("outer", tid = i);
                let _outer = outer.enter();

                let lock_clone = lock.clone();
                let t = shuttle::future::spawn(async move {
                    let mut locked = lock_clone.lock().unwrap();
                    let inner = warn_span!("inner1", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });

                t.await.unwrap();

                let _t = shuttle::future::spawn(async move {
                    let mut locked = lock.lock().unwrap();
                    let inner = warn_span!("inner2", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });
            })
        })
        .collect();

    for thread in threads {
        thread.await.unwrap();
    }
}

#[test]
fn shuttle_test_awaiting_is_irrelevant() {
    let _ = tracing_subscriber::fmt::try_init();
    shuttle::check_random(
        move || {
            shuttle::future::block_on(async {
                shuttle_awaiting_is_irrelevant().await;
            })
        },
        1,
    );
}

Doing RUST_LOG=warn cargo test shuttle_test_awaiting gives:

running 1 test
2023-11-27T16:45:55.192800Z ERROR inner1{tid=0}: r#mod::basic::tracing: incrementing from 0
2023-11-27T16:45:55.194048Z ERROR inner1{tid=1}: r#mod::basic::tracing: incrementing from 1
2023-11-27T16:45:55.199122Z ERROR inner2{tid=0}: r#mod::basic::tracing: incrementing from 2
2023-11-27T16:45:55.200569Z ERROR inner2{tid=1}: r#mod::basic::tracing: incrementing from 3
2023-11-27T16:45:55.200632Z ERROR inner1{tid=2}: r#mod::basic::tracing: incrementing from 4
2023-11-27T16:45:55.200699Z ERROR inner2{tid=2}: r#mod::basic::tracing: incrementing from 5
test basic::tracing::shuttle_test_awaiting_is_irrelevant ... ok

As we can see, this differs from what we saw when running under Tokio.

No runtime `std::thread::spawn`

#[test]
fn std_thread_joining_is_irrelevant() {
    let _ = tracing_subscriber::fmt::try_init();
    let lock = Arc::new(std::sync::Mutex::new(0));
    let threads: Vec<_> = (0..3)
        .map(|i| {
            let lock = lock.clone();
            std::thread::spawn(move || {
                let outer = warn_span!("outer", tid = i);
                let _outer = outer.enter();

                let lock_clone = lock.clone();
                let t = std::thread::spawn(move || {
                    let mut locked = lock_clone.lock().unwrap();
                    let inner = warn_span!("inner1", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });

                t.join().unwrap();

                let _t = std::thread::spawn(move || {
                    let mut locked = lock.lock().unwrap();
                    let inner = warn_span!("inner2", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });
            })
        })
        .collect();

    for thread in threads {
        thread.join().unwrap();
    }
}

Doing RUST_LOG=warn cargo test std_thread -- --nocapture gives:

running 1 test
2023-11-27T17:38:49.695078Z ERROR inner1{tid=0}: r#mod::basic::tracing: incrementing from 0
2023-11-27T17:38:49.695650Z ERROR inner1{tid=1}: r#mod::basic::tracing: incrementing from 1
2023-11-27T17:38:49.696859Z ERROR inner1{tid=2}: r#mod::basic::tracing: incrementing from 2
2023-11-27T17:38:49.697743Z ERROR inner2{tid=0}: r#mod::basic::tracing: incrementing from 3
test basic::tracing::std_thread_joining_is_irrelevant ... ok
2023-11-27T17:38:49.697917Z ERROR inner2{tid=1}: r#mod::basic::tracing: incrementing from 4

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured;

As we can see, the same as above.

Shuttle `thread::spawn`

fn shuttle_joining_is_irrelevant() {
    let lock = Arc::new(Mutex::new(0));
    let threads: Vec<_> = (0..3)
        .map(|i| {
            let lock = lock.clone();
            thread::spawn(move || {
                let outer = warn_span!("outer", tid = i);
                let _outer = outer.enter();

                let lock_clone = lock.clone();
                let t = thread::spawn(move || {
                    let mut locked = lock_clone.lock().unwrap();
                    let inner = warn_span!("inner1", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });

                t.join().unwrap();

                let _t = thread::spawn(move || {
                    let mut locked = lock.lock().unwrap();
                    let inner = warn_span!("inner2", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });
            })
        })
        .collect();

    for thread in threads {
        thread.join().unwrap();
    }
}

#[test]
fn shuttle_test_joining_is_irrelevant() {
    let _ = tracing_subscriber::fmt::try_init();
    shuttle::check_random(shuttle_joining_is_irrelevant, 1);
}

Running RUST_LOG=warn cargo test shuttle_test_join -- --nocapture gives:

running 1 test
2023-11-27T17:40:09.863872Z ERROR inner1{tid=0}: r#mod::basic::tracing: incrementing from 0
2023-11-27T17:40:09.876532Z ERROR inner2{tid=0}: r#mod::basic::tracing: incrementing from 1
2023-11-27T17:40:09.878446Z ERROR inner1{tid=2}: r#mod::basic::tracing: incrementing from 2
2023-11-27T17:40:09.878521Z ERROR inner1{tid=1}: r#mod::basic::tracing: incrementing from 3
2023-11-27T17:40:09.878573Z ERROR inner2{tid=2}: r#mod::basic::tracing: incrementing from 4
2023-11-27T17:40:09.878639Z ERROR inner2{tid=1}: r#mod::basic::tracing: incrementing from 5
test basic::tracing::shuttle_test_joining_is_irrelevant ... ok

As we can see, the same as the two above.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

sarsko · 2023-11-28T17:51:11Z

Issue when running with RUST_LOG=info:

tried to clone Id(195906583790617176), but no span exists with that ID
This may be caused by consuming a parent span (`parent: span`) rather than borrowing it (`parent: &span`).
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: <tracing_subscriber::registry::sharded::Registry as tracing_core::subscriber::Subscriber>::clone_span
   3: <tracing_subscriber::fmt::Subscriber<N,E,F,W> as tracing_core::subscriber::Subscriber>::enter
   4: amzn_shuttle_fork::runtime::execution::Execution::step::{{closure}}::{{closure}}
   5: amzn_shuttle_fork::runtime::execution::Execution::step
   6: scoped_tls::ScopedKey<T>::set
   7: amzn_shuttle_fork::runtime::execution::Execution::run
   8: tracing::span::Span::in_scope
   9: scoped_tls::ScopedKey<T>::set
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

sarsko · 2023-11-28T17:53:47Z

There is a question of whether the step and execution spans should be ERROR so that they are always included. Having that information is pretty much always useful.

jorajeev

Thanks for the CR - this will debugging a lot easier, esp when using Shuttle with proptesting. And the *_awaiting_is_irrelevant examples are really helpful!

jorajeev · 2023-12-07T13:40:50Z

src/lib.rs

+    /// `tracing_subscriber::fmt`, appends to the span on calls to `record()` (instead of
+    /// overwriting), which results in traces which are hard to read if the task is scheduled more
+    /// than a few times.
+    /// Thus: set `record_steps_in_span` to `true` if you want this behaviour, or if you are using


By this behavior you mean append behavior right? Could you make that clear?

Also: do we really need to provide this option? What other format subscribers are there for which we'd want to set this to true?

By this behavior you mean append behavior right? Could you make that clear?

By this_behaviour I mean that we call record() on the step_span. In tracing_subscriber::fmt this maps to appends. In The Nice Subscriber this maps to overwrites.

Also: do we really need to provide this option? What other format subscribers are there for which we'd want to set this to true?

Having the step logged is super useful. I think this will be set to true for all my tests. I'll make a tracing_subscriber::fmt::Layer which handles this.

jorajeev · 2023-12-07T14:07:40Z

src/runtime/execution.rs

+                        if let Some(span_id) = current_span.id().as_ref() {
+                            subscriber.exit(span_id);
+                        }
+                        state.current_mut().span = current_span;


This makes me wonder if the field should be renamed from span to last_span ?

?
Not super important what it is named, but why?
The span hasn't changed, we have just swapped out of the task. The span is the Span of the task, just as the continuation is its continuation.

src/runtime/execution.rs

src/runtime/task/mod.rs

jorajeev · 2023-12-07T14:35:50Z

tests/basic/tracing.rs

+    }
+}
+
+#[test]


Since these tests are not checking anything, maybe mark them ignored for now?

Yes. I'd like to add a Subscriber which checks them. Can ignore for now, then add that later (I'll open an issue).

tests/basic/tracing.rs

- current_task -> parent_task_id - span -> top_level_span - parent_id -> parent_span_id - #[ignore] tracing tests - move test_tracing_nested_spans up

sarsko · 2023-12-13T12:39:36Z

Nevermind, I did not look into the non ERROR/WARN-traces sufficiently. There is Span-stacking on instrumented futures.

sarsko · 2024-01-06T09:41:08Z

Interestingly the issues with instrumented futures don't show up when running with Tracing 0.2.0 (the current pre-release version).

…failures

src/lib.rs

src/runtime/execution.rs

src/runtime/runner.rs

src/runtime/task/mod.rs

jorajeev · 2024-01-26T14:35:12Z

src/runtime/task/mod.rs

-            tracing::info_span!(parent: parent_span.id(), "step", i = schedule_len, task = id.0)
-        };
+        let step_span = error_span!(parent: parent_span_id.clone(), "step", task = id.0, i = field::Empty);
+        // Note that this is slightly lazy — we are starting storing at the step_span, but could have gotten the


Hm, if we know that storing the full stack would be better, why are we cutting corners here? Why not just do the right thing?

Because I don't think it is better / I don't believe this is cutting corners. This comment and the other one of a similar nature are meant as the following: "I really believe the current scheme is correct. However, spans are both weird and tricky. If there ever are any issues, then start by doing the changes in the comments, in order to definitely rule them out.". In other words: I think the current way of doing it is "just right", and that the suggestions are "too much". I'd rather do the "just right" approach, both for overall code cleanness, and for the opportunity to be proven wrong (and thus learn more about Spans), but leave the comments in case I am wrong.

jorajeev · 2024-01-26T14:37:33Z

tests/basic/tracing.rs

+// All of the tests testing tracing will either have to have weak assertions,
+// or be #[ignore]d. The reason for this is that they are not thread safe
+// (since everything tracing is managed globally), and there is no way to
+// ensure that the tests are run single-threaded.


There is this crate: https://docs.rs/serial_test/latest/serial_test/index.html

See if it's worth using here?

Will do, thanks!

It seems to me that it won't solve the issue, unless all other tests are marked with parallel, which isn't worth it. This is because marking a test with serial only guarantees that it won't be ran at the same time as other serial or parallel tests, but it can still run at the same time as tests which are not marked.

tests/basic/tracing.rs

@jorajeev

* Tracing fix * Own Span in Task to avoid dropping the Span while Task is switched out * Respond to comments by @jorajeev - current_task -> parent_task_id - span -> top_level_span - parent_id -> parent_span_id - #[ignore] tracing tests - move test_tracing_nested_spans up * Make the execution_span and the step_span ERROR * Add instrumented futures test * Fix issue with tracing and instrumented futures * Update instrumented futures tracing test to highlight top level span failures * Clippy unneeded async * Update ResetSpanOnDrop to store an owned copy of the Span * Address comments by @jorajeev

Tracing fix

9a29bca

Own Span in Task to avoid dropping the Span while Task is switched out

a7f5688

jorajeev reviewed Dec 7, 2023

View reviewed changes

Respond to comments by @jorajeev

c4c6517

- current_task -> parent_task_id - span -> top_level_span - parent_id -> parent_span_id - #[ignore] tracing tests - move test_tracing_nested_spans up

sarsko mentioned this pull request Dec 11, 2023

[TRACING] Improved testing of tracing functionality #130

Open

Make the execution_span and the step_span ERROR

b32e021

sarsko mentioned this pull request Dec 11, 2023

[TRACING] Add test which would catch the linked panic #131

Open

sarsko and others added 6 commits January 8, 2024 01:27

Merge branch 'awslabs:main' into tracing-fix-to-be-merged

314ca16

Add instrumented futures test

6dd5a76

Fix issue with tracing and instrumented futures

c2ece8d

Update instrumented futures tracing test to highlight top level span …

a1607eb

…failures

Clippy unneeded async

954b30f

Update ResetSpanOnDrop to store an owned copy of the Span

3fc910d

jorajeev reviewed Jan 26, 2024

View reviewed changes

sarsko force-pushed the tracing-fix-to-be-merged branch from d1a90e8 to 8274c09 Compare January 27, 2024 03:23

Address comments by @jorajeev

f190a4d

sarsko force-pushed the tracing-fix-to-be-merged branch from 8274c09 to f190a4d Compare January 27, 2024 04:34

jorajeev approved these changes Jan 31, 2024

View reviewed changes

jorajeev merged commit d86479b into awslabs:main Jan 31, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracing fix #128

Tracing fix #128

sarsko commented Nov 27, 2023 •

edited

Loading

sarsko commented Nov 28, 2023

sarsko commented Nov 28, 2023

jorajeev left a comment

jorajeev Dec 7, 2023

jorajeev Dec 7, 2023

sarsko Dec 7, 2023 •

edited

Loading

jorajeev Dec 7, 2023

sarsko Dec 7, 2023

jorajeev Dec 7, 2023

sarsko Dec 7, 2023

sarsko commented Dec 13, 2023

sarsko commented Jan 6, 2024

jorajeev Jan 26, 2024

sarsko Jan 27, 2024

jorajeev Jan 26, 2024

sarsko Jan 27, 2024

sarsko Jan 27, 2024

+                  }
+              }
+              #[test]

Tracing fix #128

Tracing fix #128

Conversation

sarsko commented Nov 27, 2023 • edited Loading

Fixing tracing

Overview of the changes

Some notes

Known issues / discrepancies

Tokio

Shuttle future::spawn

No runtime std::thread::spawn

Shuttle thread::spawn

sarsko commented Nov 28, 2023

sarsko commented Nov 28, 2023

jorajeev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarsko Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarsko commented Dec 13, 2023

sarsko commented Jan 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarsko commented Nov 27, 2023 •

edited

Loading

Shuttle `future::spawn`

No runtime `std::thread::spawn`

Shuttle `thread::spawn`

sarsko Dec 7, 2023 •

edited

Loading